Bilingual Text Classification using the IBM 1 Translation Model
نویسندگان
چکیده
Abstract Manual categorisation of documents is a time-consuming task that has been significantly alleviated with the deployment of automatic and machine-aided text categorisation systems. However, the proliferation of multilingual documentation has become a common phenomenon in many international organisations, while most of the current systems has focused on the categorisation of monolingual text. It has been recently shown that the inherent redundancy in bilingual documents can be effectively exploited by relatively simple, bilingual naive Bayes (multinomial) models. In this work, we present a refined version of these models in which this redundancy is explicitly captured by a combination of a unigram (multinomial) model and the well-known IBM 1 translation model. The proposed model is evaluated on two bilingual classification tasks and compared to previous work.
منابع مشابه
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملBilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification
In many languages, sparse availability of resources causes numerous challenges for textual analysis tasks. Text classification is one of such standard tasks that is hindered due to limited availability of label information in lowresource languages. Transferring knowledge (i.e. label information) from high-resource to low-resource languages might improve text classification as compared to the ot...
متن کاملAd hoc and Multilingual Information Retrieval at IBM
IBM participated in two tracks at TREC-7: ad hoc and cross-language. In the adhoc task we contrasted the performance of two di erent query expansion techniques: local context analysis and probabilistic model. Two themes characterize IBM's participation in the CLIR track at TREC-7. The rst is the use of statistical methods. In order to use the document translation approach, we built a fast (tran...
متن کاملA Probabilistic Model of Machine Translation
A probabilistic model for computer-based generation of a machine translation system on the basis of English-Russian parallel text corpora is suggested. The model is trained using parallel text corpora with pre-aligned source and target sentences. The training of the model results in a bilingual dictionary of words and " word blocks " with relevant translation probability 1 Introduction.
متن کاملTranslation Quality Assessment of English Equivalents of Persian Proper Nouns: A case of bilingual tourist signposts in Isfahan
Abstract This study evaluated the translation quality of English equivalents of Persian proper nouns in the tourist signs and bilingual boards in Isfahan. To find different errors in the translations of the bilingual boards and tourist signs, the data were collected directly by taking picture or writing exactly from the available tourist signs and bilingual boards. Then, the errors were assesse...
متن کامل